Overview

Hex sticker for data science class
This lesson is intended to serve as a reference for early tidyverse users. Instead of focusing on what’s under the hood (i.e., why functions work the way that they do), the focus of this document is to help users determine which functions to use in different scenarios.

My hope is that this document will help limit the time that users have to ask “which function do I use for that?” As such, the material below is arranged by task.

Note: For easier navigation, I recommend opening this document in your browser window rather than the R Studio viewer pane.


Reference material

Below is some handy reference material that you can refer to throughout this lesson. Please click on the menu item to expand it.


Throughout this class, I will refer to the panes (sections) of the R Studio window. This graphic should help you remember them:


The Keyboard shortcuts we will use in this lesson are listed below for Windows and Mac operating systems.


  • Add a pipe operator: Ctrl + Shift + M
  • View all keyboard shortcuts: Ctrl + Alt + K
  • Execute code: Ctrl + Enter
  • Add assignment operator: Alt + dash
  • Add a new code section: Ctrl + shift + R
  • Create a new script file: Command + shift + N
  • Save file: Ctrl + S
  • Copy: Ctrl + C
  • Paste: Ctrl + V
  • Add a pipe operator: Command + Shift + M
  • View all keyboard shortcuts: Command + option + K
  • Execute code: Command + Enter
  • Add assignment operator: Option + dash
  • Add a new code section: Command + shift + R
  • Create a new script file: Command + shift + N
  • Save file: Command + S
  • Copy: Command + C
  • Paste: Command + V

   I strongly suggest using a keyboard shortcut whenever possible. It can be hard to do at first, using your mouse to click a button or typing a few characters is initially way easier than remembering the keyboard shortcut. If you force yourself to do it though, you won’t regret it – it saves a lot of time in the end.


Throughout this lesson, I will use terminology (i.e., jargon) that may be new to you. This menu item can be clicked at any point if you an encounter a term that you do not know or understand. Be sure to visit this menu for each lesson, because new items will be added (Never removed) as the course content expands.

  • Chained code: Code in which a given process is dependent on the outcome of a previous step.
  • Global environment: The location where objects are stored locally during the current session
  • Menu bar: The bar at the top of the RStudio window
  • Nested code: Code in which a function is written inside of another function
  • Non-nested code: Code in which a function is not written inside of another function
  • Pane: The RStudio window is divided into four sections, called panes
  • Partially nested code: Code in a chained process in which a portion of the code is nested and a portion is non-nested
  • Pipe operator: A connector between steps in a chained process in which the information in the previous step is, by default, passed to the first argument of the next step.
  • Session: Everything you do in RStudio in a single sitting
  • Tab: Each pane has a tab on the top of it. For example, I will often refer to the Environment or History tabs of the workspace pane.
  • Working directory: The location on your computer’s operating system where R will attempt to read or write files
  • &nbsp: Any time you see this icon (which is the font-awesome icon user secret), it means that there is a tip that you should pay attention to that will improve your data science workflow.


Throughout this lesson and others there are several bash commands you will have to use in the Terminal tab of your Console pane. These include:

  • mkdir: Make directory
  • ls: List files
  • cd: Change directory
  • cd ..: Change to parent directory
  • mv: Move directory
  • pwd: Print working directory
  • rm: Remove file
  • rm -rf: Remove folder


I strongly recommend downloading and printing the various cheatsheets that have been built by the team at RStudio. They’re super helpful!

Advanced Rt Cheatsheet

Environments, data structures, functions, subsetting, and more are covered in this handy cheatsheet written by Arianne Colton and Sean Chen. Updated Feb 16. Download


Data Import Cheatsheet

The Data Import cheatsheet reminds you how to read in flat files with http://readr.tidyverse.org/, work with the results as tibbles, and reshape messy data with tidyr. Use tidyr to reshape your tables into tidy data, the data format that works the most seamlessly with R and the tidyverse. Updated January 17. Download


Data Transformation Cheatsheet

tidyr provides you with tools to transform untidy data into tidy data in R. This cheatsheet will guide you through the grammar, reminding you how to join, filter, arrange, mutate, summarise, group, and join data frames and tibbles. (Previous version) Updated January 17. Download

Data Transformation Cheatsheet

dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, giving you a quick reference on how to select, filter, arrange, mutate, summarize, group, and join data frames and tibbles. (Previous version) Updated January 17. Download


Data Visualization Cheatsheet

The ggplot2 package lets you make beautiful and customizable plots of your data. It implements the grammar of graphics, an easy to use system for building plots. See ggplot2.tidyverse.org for more. Updated November 16. Download


RStudio IDE Cheatsheet

The RStudio IDE is the most popular integrated development environment for R. Do you want to write, run, and debug your own R code? Work collaboratively on R projects with version control? Build packages or create documents and apps? No matter what you do with R, the RStudio IDE can help you do it faster. This cheatsheet will guide you through the most useful features of the IDE, as well as the long list of keyboard shortcuts built into the RStudio IDE. Updated January 16. Download



Setup

  1. Open R Studio and a new script file. Remember that it’s best practice to start with a clean R Studio session!

  2. Add a new code section and call it “setup”

  3. After a space between your section break, include and run the following:

library(lubridate)
library(tidyverse)

read_rds('data/raw/district_birds.rds') %>% 
  list2env(.GlobalEnv)

Subset data frame columns

The select() family of functions is used to subset data frame columns.

Let’s take a look at the full captures data frame (Note: You may need to modify the width of your viewer pane to see all of the columns at once):

captures
## # A tibble: 6,835 × 12
##    capture_id visit_id band_…¹ color…² spp   sex   age    wing    tl  mass bp_cp
##    <chr>      <chr>    <chr>   <chr>   <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
##  1 9kIB37f    b1BQ67f… 2261-3… B_WX    SOSP  M     AHY      63    NA  23.9 <NA> 
##  2 9XKm7o1    b1BQ67f… 1991-4… M_XP    GRCA  M     AHY      90    NA  37.4 <NA> 
##  3 6J74507    b1BQ67f… 2261-3… XYG_    SOSP  M     AHY      67    NA  25.9 <NA> 
##  4 7E5T42L    b1BQ67f… 1991-4… XOM_    NOCA  M     AHY      93    NA  42.2 <NA> 
##  5 0551Y4q    76569Hj… 2411-3… X_PK    GRCA  M     AHY      85    NA  33   <NA> 
##  6 FJ72332    76569Hj… 2411-3… ORX_    NOCA  F     AHY      91    NA  45   <NA> 
##  7 BD1A0kZ    76569Hj… 2411-3… BXP_    GRCA  M     AHY      87    NA  45   <NA> 
##  8 136wur8    76569Hj… 2341-2… X_YK    CARW  U     HY       NA    NA  22   <NA> 
##  9 U8A8454    76569Hj… 1202-9… MXO_    AMRO  F     AHY     117    NA  91   <NA> 
## 10 Q15TLAw    76569Hj… 1202-9… WKX_    AMRO  M     AHY     139    NA  75   <NA> 
## # … with 6,825 more rows, 1 more variable: fat <chr>, and abbreviated variable
## #   names ¹​band_number, ²​color_combo

Subset by column name(s)

Single column: To select a single column, we can provide a one-value vector with the name of the column (quotes are not necessary!):

captures %>% 
  select(band_number)
## # A tibble: 6,835 × 1
##    band_number
##    <chr>      
##  1 2261-38348 
##  2 1991-46167 
##  3 2261-38347 
##  4 1991-46166 
##  5 2411-37708 
##  6 2411-37710 
##  7 2411-37709 
##  8 2341-29331 
##  9 1202-99076 
## 10 1202-99075 
## # … with 6,825 more rows

Adjacent columns: We can select a vector of adjacent columns by adding a colon between the first and last column of interest:

captures %>% 
  select(band_number:age)
## # A tibble: 6,835 × 5
##    band_number color_combo spp   sex   age  
##    <chr>       <chr>       <chr> <chr> <chr>
##  1 2261-38348  B_WX        SOSP  M     AHY  
##  2 1991-46167  M_XP        GRCA  M     AHY  
##  3 2261-38347  XYG_        SOSP  M     AHY  
##  4 1991-46166  XOM_        NOCA  M     AHY  
##  5 2411-37708  X_PK        GRCA  M     AHY  
##  6 2411-37710  ORX_        NOCA  F     AHY  
##  7 2411-37709  BXP_        GRCA  M     AHY  
##  8 2341-29331  X_YK        CARW  U     HY   
##  9 1202-99076  MXO_        AMRO  F     AHY  
## 10 1202-99075  WKX_        AMRO  M     AHY  
## # … with 6,825 more rows

Non-adjacent columns: We can select a vector of non-adjacent columns using the combine function c to specify our vector of column names:

captures %>% 
  select(c(band_number, age))
## # A tibble: 6,835 × 2
##    band_number age  
##    <chr>       <chr>
##  1 2261-38348  AHY  
##  2 1991-46167  AHY  
##  3 2261-38347  AHY  
##  4 1991-46166  AHY  
##  5 2411-37708  AHY  
##  6 2411-37710  AHY  
##  7 2411-37709  AHY  
##  8 2341-29331  HY   
##  9 1202-99076  AHY  
## 10 1202-99075  AHY  
## # … with 6,825 more rows

Sets of columns: We can select non-adjacent and adjacent columns in a single select operation:

captures %>% 
  select(
    c(capture_id,
      band_number:age))
## # A tibble: 6,835 × 6
##    capture_id band_number color_combo spp   sex   age  
##    <chr>      <chr>       <chr>       <chr> <chr> <chr>
##  1 9kIB37f    2261-38348  B_WX        SOSP  M     AHY  
##  2 9XKm7o1    1991-46167  M_XP        GRCA  M     AHY  
##  3 6J74507    2261-38347  XYG_        SOSP  M     AHY  
##  4 7E5T42L    1991-46166  XOM_        NOCA  M     AHY  
##  5 0551Y4q    2411-37708  X_PK        GRCA  M     AHY  
##  6 FJ72332    2411-37710  ORX_        NOCA  F     AHY  
##  7 BD1A0kZ    2411-37709  BXP_        GRCA  M     AHY  
##  8 136wur8    2341-29331  X_YK        CARW  U     HY   
##  9 U8A8454    1202-99076  MXO_        AMRO  F     AHY  
## 10 Q15TLAw    1202-99075  WKX_        AMRO  M     AHY  
## # … with 6,825 more rows

… or multiple sets of adjacent columns:

captures %>% 
  select(
    c(capture_id,
      band_number:color_combo,
      wing:mass))
## # A tibble: 6,835 × 6
##    capture_id band_number color_combo  wing    tl  mass
##    <chr>      <chr>       <chr>       <dbl> <dbl> <dbl>
##  1 9kIB37f    2261-38348  B_WX           63    NA  23.9
##  2 9XKm7o1    1991-46167  M_XP           90    NA  37.4
##  3 6J74507    2261-38347  XYG_           67    NA  25.9
##  4 7E5T42L    1991-46166  XOM_           93    NA  42.2
##  5 0551Y4q    2411-37708  X_PK           85    NA  33  
##  6 FJ72332    2411-37710  ORX_           91    NA  45  
##  7 BD1A0kZ    2411-37709  BXP_           87    NA  45  
##  8 136wur8    2341-29331  X_YK           NA    NA  22  
##  9 U8A8454    1202-99076  MXO_          117    NA  91  
## 10 Q15TLAw    1202-99075  WKX_          139    NA  75  
## # … with 6,825 more rows

The above could have also been written as:

captures %>% 
  select(
    c(capture_id,
      band_number,
      color_combo,
      wing:mass))
## # A tibble: 6,835 × 6
##    capture_id band_number color_combo  wing    tl  mass
##    <chr>      <chr>       <chr>       <dbl> <dbl> <dbl>
##  1 9kIB37f    2261-38348  B_WX           63    NA  23.9
##  2 9XKm7o1    1991-46167  M_XP           90    NA  37.4
##  3 6J74507    2261-38347  XYG_           67    NA  25.9
##  4 7E5T42L    1991-46166  XOM_           93    NA  42.2
##  5 0551Y4q    2411-37708  X_PK           85    NA  33  
##  6 FJ72332    2411-37710  ORX_           91    NA  45  
##  7 BD1A0kZ    2411-37709  BXP_           87    NA  45  
##  8 136wur8    2341-29331  X_YK           NA    NA  22  
##  9 U8A8454    1202-99076  MXO_          117    NA  91  
## 10 Q15TLAw    1202-99075  WKX_          139    NA  75  
## # … with 6,825 more rows

Remove columns

Remove a single column: A negated selection will return all columns except for those specified by the negation operator, the !. Here, we’ll remove the column band_number:

captures %>% 
  select(!band_number)
## # A tibble: 6,835 × 11
##    capture_id visit_id   color…¹ spp   sex   age    wing    tl  mass bp_cp fat  
##    <chr>      <chr>      <chr>   <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
##  1 9kIB37f    b1BQ67f46… B_WX    SOSP  M     AHY      63    NA  23.9 <NA>  <NA> 
##  2 9XKm7o1    b1BQ67f46… M_XP    GRCA  M     AHY      90    NA  37.4 <NA>  <NA> 
##  3 6J74507    b1BQ67f46… XYG_    SOSP  M     AHY      67    NA  25.9 <NA>  <NA> 
##  4 7E5T42L    b1BQ67f46… XOM_    NOCA  M     AHY      93    NA  42.2 <NA>  <NA> 
##  5 0551Y4q    76569Hj4a… X_PK    GRCA  M     AHY      85    NA  33   <NA>  <NA> 
##  6 FJ72332    76569Hj4a… ORX_    NOCA  F     AHY      91    NA  45   <NA>  <NA> 
##  7 BD1A0kZ    76569Hj4a… BXP_    GRCA  M     AHY      87    NA  45   <NA>  <NA> 
##  8 136wur8    76569Hj4a… X_YK    CARW  U     HY       NA    NA  22   <NA>  <NA> 
##  9 U8A8454    76569Hj4a… MXO_    AMRO  F     AHY     117    NA  91   <NA>  <NA> 
## 10 Q15TLAw    76569Hj4a… WKX_    AMRO  M     AHY     139    NA  75   <NA>  <NA> 
## # … with 6,825 more rows, and abbreviated variable name ¹​color_combo

Remove sets of columns: The negation operator can be used with multiple columns as well.

We can remove a set of adjacent columns using : to specify a range of columns to remove:

captures %>% 
  select(!band_number:age)
## # A tibble: 6,835 × 7
##    capture_id visit_id         wing    tl  mass bp_cp fat  
##    <chr>      <chr>           <dbl> <dbl> <dbl> <chr> <chr>
##  1 9kIB37f    b1BQ67f46a93580    63    NA  23.9 <NA>  <NA> 
##  2 9XKm7o1    b1BQ67f46a93580    90    NA  37.4 <NA>  <NA> 
##  3 6J74507    b1BQ67f46a93580    67    NA  25.9 <NA>  <NA> 
##  4 7E5T42L    b1BQ67f46a93580    93    NA  42.2 <NA>  <NA> 
##  5 0551Y4q    76569Hj4a1dd400    85    NA  33   <NA>  <NA> 
##  6 FJ72332    76569Hj4a1dd400    91    NA  45   <NA>  <NA> 
##  7 BD1A0kZ    76569Hj4a1dd400    87    NA  45   <NA>  <NA> 
##  8 136wur8    76569Hj4a1dd400    NA    NA  22   <NA>  <NA> 
##  9 U8A8454    76569Hj4a1dd400   117    NA  91   <NA>  <NA> 
## 10 Q15TLAw    76569Hj4a1dd400   139    NA  75   <NA>  <NA> 
## # … with 6,825 more rows

To remove a set of non-adjacent columns, we have to wrap our column names in the combine function, c:

captures %>% 
  select(!c(band_number, age))
## # A tibble: 6,835 × 10
##    capture_id visit_id        color_…¹ spp   sex    wing    tl  mass bp_cp fat  
##    <chr>      <chr>           <chr>    <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
##  1 9kIB37f    b1BQ67f46a93580 B_WX     SOSP  M        63    NA  23.9 <NA>  <NA> 
##  2 9XKm7o1    b1BQ67f46a93580 M_XP     GRCA  M        90    NA  37.4 <NA>  <NA> 
##  3 6J74507    b1BQ67f46a93580 XYG_     SOSP  M        67    NA  25.9 <NA>  <NA> 
##  4 7E5T42L    b1BQ67f46a93580 XOM_     NOCA  M        93    NA  42.2 <NA>  <NA> 
##  5 0551Y4q    76569Hj4a1dd400 X_PK     GRCA  M        85    NA  33   <NA>  <NA> 
##  6 FJ72332    76569Hj4a1dd400 ORX_     NOCA  F        91    NA  45   <NA>  <NA> 
##  7 BD1A0kZ    76569Hj4a1dd400 BXP_     GRCA  M        87    NA  45   <NA>  <NA> 
##  8 136wur8    76569Hj4a1dd400 X_YK     CARW  U        NA    NA  22   <NA>  <NA> 
##  9 U8A8454    76569Hj4a1dd400 MXO_     AMRO  F       117    NA  91   <NA>  <NA> 
## 10 Q15TLAw    76569Hj4a1dd400 WKX_     AMRO  M       139    NA  75   <NA>  <NA> 
## # … with 6,825 more rows, and abbreviated variable name ¹​color_combo

Subset columns by condition

By value class: We can subset columns using a logical test. For example, the below subsets the data to columns where() the column class is numeric:

captures %>% 
  select(
    where(is.numeric))
## # A tibble: 6,835 × 3
##     wing    tl  mass
##    <dbl> <dbl> <dbl>
##  1    63    NA  23.9
##  2    90    NA  37.4
##  3    67    NA  25.9
##  4    93    NA  42.2
##  5    85    NA  33  
##  6    91    NA  45  
##  7    87    NA  45  
##  8    NA    NA  22  
##  9   117    NA  91  
## 10   139    NA  75  
## # … with 6,825 more rows

… and the below subsets the data to columns where() the column class is character:

captures %>% 
  select(
    where(is.character))
## # A tibble: 6,835 × 9
##    capture_id visit_id        band_number color_…¹ spp   sex   age   bp_cp fat  
##    <chr>      <chr>           <chr>       <chr>    <chr> <chr> <chr> <chr> <chr>
##  1 9kIB37f    b1BQ67f46a93580 2261-38348  B_WX     SOSP  M     AHY   <NA>  <NA> 
##  2 9XKm7o1    b1BQ67f46a93580 1991-46167  M_XP     GRCA  M     AHY   <NA>  <NA> 
##  3 6J74507    b1BQ67f46a93580 2261-38347  XYG_     SOSP  M     AHY   <NA>  <NA> 
##  4 7E5T42L    b1BQ67f46a93580 1991-46166  XOM_     NOCA  M     AHY   <NA>  <NA> 
##  5 0551Y4q    76569Hj4a1dd400 2411-37708  X_PK     GRCA  M     AHY   <NA>  <NA> 
##  6 FJ72332    76569Hj4a1dd400 2411-37710  ORX_     NOCA  F     AHY   <NA>  <NA> 
##  7 BD1A0kZ    76569Hj4a1dd400 2411-37709  BXP_     GRCA  M     AHY   <NA>  <NA> 
##  8 136wur8    76569Hj4a1dd400 2341-29331  X_YK     CARW  U     HY    <NA>  <NA> 
##  9 U8A8454    76569Hj4a1dd400 1202-99076  MXO_     AMRO  F     AHY   <NA>  <NA> 
## 10 Q15TLAw    76569Hj4a1dd400 1202-99075  WKX_     AMRO  M     AHY   <NA>  <NA> 
## # … with 6,825 more rows, and abbreviated variable name ¹​color_combo

Partial string match: We can also search for strings in columns names when subsetting using the contains() function. Here, we subset the data to any column that contains the string _id:

captures %>% 
  select(
    contains('_id'))
## # A tibble: 6,835 × 2
##    capture_id visit_id       
##    <chr>      <chr>          
##  1 9kIB37f    b1BQ67f46a93580
##  2 9XKm7o1    b1BQ67f46a93580
##  3 6J74507    b1BQ67f46a93580
##  4 7E5T42L    b1BQ67f46a93580
##  5 0551Y4q    76569Hj4a1dd400
##  6 FJ72332    76569Hj4a1dd400
##  7 BD1A0kZ    76569Hj4a1dd400
##  8 136wur8    76569Hj4a1dd400
##  9 U8A8454    76569Hj4a1dd400
## 10 Q15TLAw    76569Hj4a1dd400
## # … with 6,825 more rows

… and here we use the negation operator to subset to any column that does not contain the string _id:

captures %>% 
  select(
    !contains('_id'))
## # A tibble: 6,835 × 10
##    band_number color_combo spp   sex   age    wing    tl  mass bp_cp fat  
##    <chr>       <chr>       <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
##  1 2261-38348  B_WX        SOSP  M     AHY      63    NA  23.9 <NA>  <NA> 
##  2 1991-46167  M_XP        GRCA  M     AHY      90    NA  37.4 <NA>  <NA> 
##  3 2261-38347  XYG_        SOSP  M     AHY      67    NA  25.9 <NA>  <NA> 
##  4 1991-46166  XOM_        NOCA  M     AHY      93    NA  42.2 <NA>  <NA> 
##  5 2411-37708  X_PK        GRCA  M     AHY      85    NA  33   <NA>  <NA> 
##  6 2411-37710  ORX_        NOCA  F     AHY      91    NA  45   <NA>  <NA> 
##  7 2411-37709  BXP_        GRCA  M     AHY      87    NA  45   <NA>  <NA> 
##  8 2341-29331  X_YK        CARW  U     HY       NA    NA  22   <NA>  <NA> 
##  9 1202-99076  MXO_        AMRO  F     AHY     117    NA  91   <NA>  <NA> 
## 10 1202-99075  WKX_        AMRO  M     AHY     139    NA  75   <NA>  <NA> 
## # … with 6,825 more rows

Changing column names

Using select: During the column selection process, we can change the name of a selected column:

captures %>% 
  select(band_number,
         color_combo,
         species = spp)
## # A tibble: 6,835 × 3
##    band_number color_combo species
##    <chr>       <chr>       <chr>  
##  1 2261-38348  B_WX        SOSP   
##  2 1991-46167  M_XP        GRCA   
##  3 2261-38347  XYG_        SOSP   
##  4 1991-46166  XOM_        NOCA   
##  5 2411-37708  X_PK        GRCA   
##  6 2411-37710  ORX_        NOCA   
##  7 2411-37709  BXP_        GRCA   
##  8 2341-29331  X_YK        CARW   
##  9 1202-99076  MXO_        AMRO   
## 10 1202-99075  WKX_        AMRO   
## # … with 6,825 more rows

Using rename: If we do not need to subset the columns, we may want to choose the rename() function instead:

captures %>% 
  rename(species = spp)
## # A tibble: 6,835 × 12
##    capture…¹ visit…² band_…³ color…⁴ species sex   age    wing    tl  mass bp_cp
##    <chr>     <chr>   <chr>   <chr>   <chr>   <chr> <chr> <dbl> <dbl> <dbl> <chr>
##  1 9kIB37f   b1BQ67… 2261-3… B_WX    SOSP    M     AHY      63    NA  23.9 <NA> 
##  2 9XKm7o1   b1BQ67… 1991-4… M_XP    GRCA    M     AHY      90    NA  37.4 <NA> 
##  3 6J74507   b1BQ67… 2261-3… XYG_    SOSP    M     AHY      67    NA  25.9 <NA> 
##  4 7E5T42L   b1BQ67… 1991-4… XOM_    NOCA    M     AHY      93    NA  42.2 <NA> 
##  5 0551Y4q   76569H… 2411-3… X_PK    GRCA    M     AHY      85    NA  33   <NA> 
##  6 FJ72332   76569H… 2411-3… ORX_    NOCA    F     AHY      91    NA  45   <NA> 
##  7 BD1A0kZ   76569H… 2411-3… BXP_    GRCA    M     AHY      87    NA  45   <NA> 
##  8 136wur8   76569H… 2341-2… X_YK    CARW    U     HY       NA    NA  22   <NA> 
##  9 U8A8454   76569H… 1202-9… MXO_    AMRO    F     AHY     117    NA  91   <NA> 
## 10 Q15TLAw   76569H… 1202-9… WKX_    AMRO    M     AHY     139    NA  75   <NA> 
## # … with 6,825 more rows, 1 more variable: fat <chr>, and abbreviated variable
## #   names ¹​capture_id, ²​visit_id, ³​band_number, ⁴​color_combo
  Now you!

Subset captures to the fields band number, spp, sex, wing, tl (tail length), and mass. Assign the resultant object to your global environment with the key measures.

You can see my answer below by clicking the “Code” button. Give this an honest try before you do though!

measures <-
  captures %>% 
  select(band_number, 
         spp:sex, 
         wing:mass)

   A decision that you’ll sometimes have to make when subsetting is whether to subset by providing the names of your columns of interest or with negation. I typically choose the method that requires the least amount of coding.


Reshape data frames

Being able to re-arrange data frames can be super important when exploring tabular data or preparing the data to be plotted.

Reorder columns

Reorder columns with select: We can use the select() function to reorder columns while we select them (for measures see “Now you!”, above):

measures %>% 
  select(spp, 
         band_number:mass)
## # A tibble: 6,835 × 6
##    spp   band_number sex    wing    tl  mass
##    <chr> <chr>       <chr> <dbl> <dbl> <dbl>
##  1 SOSP  2261-38348  M        63    NA  23.9
##  2 GRCA  1991-46167  M        90    NA  37.4
##  3 SOSP  2261-38347  M        67    NA  25.9
##  4 NOCA  1991-46166  M        93    NA  42.2
##  5 GRCA  2411-37708  M        85    NA  33  
##  6 NOCA  2411-37710  F        91    NA  45  
##  7 GRCA  2411-37709  M        87    NA  45  
##  8 CARW  2341-29331  U        NA    NA  22  
##  9 AMRO  1202-99076  F       117    NA  91  
## 10 AMRO  1202-99075  M       139    NA  75  
## # … with 6,825 more rows

If we have a lot of columns, or want to make our code more generalize-able, we can use the everything() function. The code below says to select spp and then every other column after that.

measures %>% 
  select(spp, everything())
## # A tibble: 6,835 × 6
##    spp   band_number sex    wing    tl  mass
##    <chr> <chr>       <chr> <dbl> <dbl> <dbl>
##  1 SOSP  2261-38348  M        63    NA  23.9
##  2 GRCA  1991-46167  M        90    NA  37.4
##  3 SOSP  2261-38347  M        67    NA  25.9
##  4 NOCA  1991-46166  M        93    NA  42.2
##  5 GRCA  2411-37708  M        85    NA  33  
##  6 NOCA  2411-37710  F        91    NA  45  
##  7 GRCA  2411-37709  M        87    NA  45  
##  8 CARW  2341-29331  U        NA    NA  22  
##  9 AMRO  1202-99076  F       117    NA  91  
## 10 AMRO  1202-99075  M       139    NA  75  
## # … with 6,825 more rows

Move the location of a column: We can also just move a column to the front using the relocate() function:

measures %>% 
  relocate(spp)
## # A tibble: 6,835 × 6
##    spp   band_number sex    wing    tl  mass
##    <chr> <chr>       <chr> <dbl> <dbl> <dbl>
##  1 SOSP  2261-38348  M        63    NA  23.9
##  2 GRCA  1991-46167  M        90    NA  37.4
##  3 SOSP  2261-38347  M        67    NA  25.9
##  4 NOCA  1991-46166  M        93    NA  42.2
##  5 GRCA  2411-37708  M        85    NA  33  
##  6 NOCA  2411-37710  F        91    NA  45  
##  7 GRCA  2411-37709  M        87    NA  45  
##  8 CARW  2341-29331  U        NA    NA  22  
##  9 AMRO  1202-99076  F       117    NA  91  
## 10 AMRO  1202-99075  M       139    NA  75  
## # … with 6,825 more rows

Arrange data frames by row

Arranging data frame by row, using the arrange() function can provide a powerful tool for data frame exploration.

Arrange from lowest to highest values: By default, arrange() sorts a data frame from the lowest to highest value in a specified column. The following arranges measures from lowest to highest mass measurement:

measures %>%
  arrange(mass)
## # A tibble: 6,835 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 2750-41582  CACH  U      62      56   6.3
##  2 2460-22927  CACH  F      60      NA   7  
##  3 2750-44514  HOWR  U      26.5    NA   7.1
##  4 2810-03340  HOWR  M      49.5    40   7.1
##  5 2430-29237  CACH  U      63      NA   7.2
##  6 2250-24271  CACH  F      NA      NA   7.3
##  7 2750-41580  HOWR  M      51      44   7.4
##  8 2750-41591  CACH  U      58      56   7.5
##  9 2810-03341  HOWR  F      47      44   7.5
## 10 2590-82234  CACH  U      58      NA   7.6
## # … with 6,825 more rows

Arrange from highest to lowest values: We can arrange a data frame in descending order using the desc() function nested inside of arrange():

measures %>% 
  arrange(
    desc(mass))
## # A tibble: 6,835 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 1352-46513  AMRO  F       125   NA  111. 
##  2 1573-45513  COGR  F       125  106. 110. 
##  3 1573-45514  COGR  F       125  111  107. 
##  4 1342-72973  AMRO  M       124   NA  102  
##  5 1272-43002  AMRO  F       121   NA  101  
##  6 1352-46575  AMRO  F       121   93   98.7
##  7 1272-43091  AMRO  F       127   NA   97.8
##  8 1352-46269  AMRO  F       125   NA   96  
##  9 1232-94730  AMRO  F       121   NA   95.1
## 10 1232-94720  AMRO  M       128   NA   94.5
## # … with 6,825 more rows

Pivoting: wide to long form

We often want to convert data frames from wide to long format. This is often an important operation in data tidying (because many data frames built in Excel are an untidy wide format). It also can be useful when we want to plot several variables in a single plot (e.g., with variables mapped to different facts of a plot or a color aesthetic mapped to a given variable).

To do so, we use pivot_longer. We supply the data frame we are pivoting (here that data frame is piped in), the selection of columns to pivot, the name for the variable classes (from the column names), and the name of the value column.

measures %>% 
  pivot_longer(
    wing:mass,
    names_to = 'measurement',
    values_to = 'value')
## # A tibble: 20,505 × 5
##    band_number spp   sex   measurement value
##    <chr>       <chr> <chr> <chr>       <dbl>
##  1 2261-38348  SOSP  M     wing         63  
##  2 2261-38348  SOSP  M     tl           NA  
##  3 2261-38348  SOSP  M     mass         23.9
##  4 1991-46167  GRCA  M     wing         90  
##  5 1991-46167  GRCA  M     tl           NA  
##  6 1991-46167  GRCA  M     mass         37.4
##  7 2261-38347  SOSP  M     wing         67  
##  8 2261-38347  SOSP  M     tl           NA  
##  9 2261-38347  SOSP  M     mass         25.9
## 10 1991-46166  NOCA  M     wing         93  
## # … with 20,495 more rows

Pivoting: long to wide form

We can also pivot to a wider data frame. This is sometimes useful in data tidying (when multiple variables are stored in one column), but is most often used when preparing data for analysis or when you need to visually explore tabular data.

To pivot a data frame to a wider format we use the function pivot_wider(). Here, we supply the data frame, the name of the column that contains our variables of interest.

We’re going to use the counts data frame for this (the above has a problem that we’ll address in the next segment) to create a wide frame table of counts by visit.

counts %>% 
  select(-count_id) %>% 
  pivot_wider(
    names_from = spp, 
    values_from = count) %>% 
  select(visit_id:NOCA)
## # A tibble: 316 × 5
##    visit_id         AMRO  GRCA  MODO  NOCA
##    <chr>           <dbl> <dbl> <dbl> <dbl>
##  1 3Fxh65U4fb43f80     2     2     1     2
##  2 8iXAa9851afd100     1     1     1     4
##  3 150jN34577eed00     1    NA    NA     4
##  4 0n075TL4ff23600     4    NA     2     2
##  5 G686H0051edc780     3     1     1     4
##  6 jrS1z12576dc980     1    NA    NA     1
##  7 0dg6s68592e0780     6     1    NA     2
##  8 q1fkjW85b15d280     5    NA    NA    NA
##  9 24tsYn85ce33f80     3    NA    NA     8
## 10 Nbq2mK25ed6e800     9     1    NA     2
## # … with 306 more rows

Note: I included the select() function at the end just so this didn’t make a mess when it printed.

This method can be used to quickly format data for many common community analyses. Of course, all of those NA values in there should really be zeros (the species was not observed during the visit). We can add the values_fill argument to pivot_wider() to replace those values during pivoting.

counts %>% 
  select(-count_id) %>% 
  pivot_wider(
    names_from = spp, 
    values_from = count,
    values_fill = 0) %>% 
  select(visit_id:NOCA)
## # A tibble: 316 × 5
##    visit_id         AMRO  GRCA  MODO  NOCA
##    <chr>           <dbl> <dbl> <dbl> <dbl>
##  1 3Fxh65U4fb43f80     2     2     1     2
##  2 8iXAa9851afd100     1     1     1     4
##  3 150jN34577eed00     1     0     0     4
##  4 0n075TL4ff23600     4     0     2     2
##  5 G686H0051edc780     3     1     1     4
##  6 jrS1z12576dc980     1     0     0     1
##  7 0dg6s68592e0780     6     1     0     2
##  8 q1fkjW85b15d280     5     0     0     0
##  9 24tsYn85ce33f80     3     0     0     8
## 10 Nbq2mK25ed6e800     9     1     0     2
## # … with 306 more rows

Subset by row position

The most straightforward (and sometimes useful) way to subset the rows of a data frame is to extract rows by position.

Slice a row by position: Using the function slice(), we simply provide the row number that we would like to retrieve:

measures %>% 
  slice(1)
## # A tibble: 1 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348  SOSP  M        63    NA  23.9

Slice adjacent rows by position: To slice adjacent rows, we provide a vector of row numbers, with lower and upper values separated with a : operator:

measures %>% 
  slice(1:5)
## # A tibble: 5 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348  SOSP  M        63    NA  23.9
## 2 1991-46167  GRCA  M        90    NA  37.4
## 3 2261-38347  SOSP  M        67    NA  25.9
## 4 1991-46166  NOCA  M        93    NA  42.2
## 5 2411-37708  GRCA  M        85    NA  33

Slice non-adjacent rows by position: For non-adjacent rows, we supply our vector of rows using the combine function, c():

measures %>% 
  slice(
    c(2, 3, 5))
## # A tibble: 3 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 1991-46167  GRCA  M        90    NA  37.4
## 2 2261-38347  SOSP  M        67    NA  25.9
## 3 2411-37708  GRCA  M        85    NA  33

Subset the data to the top row(s): The function slice_head() can be a handy tool for just grabbing the top row:

measures %>% 
  slice_head()
## # A tibble: 1 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348  SOSP  M        63    NA  23.9

… or a specified number of rows from the top of the data frame:

measures %>% 
  slice_head(n = 5)
## # A tibble: 5 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348  SOSP  M        63    NA  23.9
## 2 1991-46167  GRCA  M        90    NA  37.4
## 3 2261-38347  SOSP  M        67    NA  25.9
## 4 1991-46166  NOCA  M        93    NA  42.2
## 5 2411-37708  GRCA  M        85    NA  33

Subset the data to the last row(s): The opposite of slice_head() is slice_tail(), which can be used to grab the last row in a data frame:

measures %>% 
  slice_tail()
## # A tibble: 1 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-44542  CACH  U        59    49   9.1

… or the last several rows in a data frame:

measures %>% 
  slice_tail(n = 5)
## # A tibble: 5 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-44538  CACH  U      57    48     8.3
## 2 2750-44541  CACH  U      60.5  52.5   9.2
## 3 2750-44543  CACH  U      59    49    NA  
## 4 2750-44540  CACH  F      59.5  50     9.3
## 5 2750-44542  CACH  U      59    49     9.1

Slice and arrange: The slice() functions becomes a very powerful tool for data exploration and wrangling when used in conjunction with arrange(). For example, perhaps we want to see the five lowest mass values in the measures data:

measures %>% 
  arrange(mass) %>% 
  slice_head(n = 5)
## # A tibble: 5 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-41582  CACH  U      62      56   6.3
## 2 2460-22927  CACH  F      60      NA   7  
## 3 2750-44514  HOWR  U      26.5    NA   7.1
## 4 2810-03340  HOWR  M      49.5    40   7.1
## 5 2430-29237  CACH  U      63      NA   7.2

Grouped row subsets

When you group a data frame, using the function group_by(), you split the data by a variable:

measures %>%
  group_by(spp)
## # A tibble: 6,835 × 6
## # Groups:   spp [23]
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 2261-38348  SOSP  M        63    NA  23.9
##  2 1991-46167  GRCA  M        90    NA  37.4
##  3 2261-38347  SOSP  M        67    NA  25.9
##  4 1991-46166  NOCA  M        93    NA  42.2
##  5 2411-37708  GRCA  M        85    NA  33  
##  6 2411-37710  NOCA  F        91    NA  45  
##  7 2411-37709  GRCA  M        87    NA  45  
##  8 2341-29331  CARW  U        NA    NA  22  
##  9 1202-99076  AMRO  F       117    NA  91  
## 10 1202-99075  AMRO  M       139    NA  75  
## # … with 6,825 more rows

It doesn’t look like anything happened but it did. Notice that, just above the tibble, you can see that there are now groups (Groups: spp [23]).

Each function run after the group_by() function will run on each group separately. This can be very useful for subsetting data frames.

For example, we can extract the three lowest mass values for each species with:

measures %>%
  arrange(mass) %>% 
  group_by(spp) %>% 
  slice_head(n = 2)
## # A tibble: 41 × 6
## # Groups:   spp [23]
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 2660-67607  AMGO  F      66    45    11.3
##  2 2750-44551  AMGO  M      70    46    12.2
##  3 0962-47742  AMRO  U      NA    NA    38.8
##  4 0742-29601  AMRO  U     105    NA    45.4
##  5 2711-77728  BHCO  U      40.5  14    23  
##  6 2711-77729  BHCO  U      40.4  15.6  24  
##  7 1232-45256  BLJA  U     131   133    89.2
##  8 2750-41582  CACH  U      62    56     6.3
##  9 2460-22927  CACH  F      60    NA     7  
## 10 2571-90110  CARW  U      56    NA     8  
## # … with 31 more rows

   Whenever you use group_by(), you should always ungroup the data afterwards or else all subsequent functions will run on the grouped data!

measures %>%
  arrange(mass) %>% 
  group_by(spp) %>% 
  slice_head(n = 3) %>% 
  ungroup()
## # A tibble: 57 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 2660-67607  AMGO  F      66    45    11.3
##  2 2750-44551  AMGO  M      70    46    12.2
##  3 2750-44550  AMGO  F      68.5  44    12.7
##  4 0962-47742  AMRO  U      NA    NA    38.8
##  5 0742-29601  AMRO  U     105    NA    45.4
##  6 0742-29427  AMRO  U     103    NA    48.9
##  7 2711-77728  BHCO  U      40.5  14    23  
##  8 2711-77729  BHCO  U      40.4  15.6  24  
##  9 2661-31502  BHCO  F      92    61    35.3
## 10 1232-45256  BLJA  U     131   133    89.2
## # … with 47 more rows
  Now you!

Individuals may have been captured and measured more than once. The band number is a unique identifier for each individual. Subset the data to the first record for each individual. Assign to your global environment with the name first_measures.

first_measures <-
  measures %>% 
  group_by(band_number) %>% 
  slice_head() %>% 
  ungroup()

Subset rows by condition

We can subset rows based on whether values satisfy a given condition. This is called filtering and is achieved using the filter() function.

Filter by one variable: To filter based on condition, we provide the name of the target column, a relational operator (see ?Comparison), and the condition upon which to filter the data.

For example, in the below, I filter first_measures (see “Now you!”, above) to rows where the wing value is greater than 80 mm:

first_measures %>% 
  filter(wing > 80) %>% 
  arrange(wing)
## # A tibble: 2,950 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 1141-05732  TUTI  F      80.5  69.5  20.4
##  2 0791-17253  GRCA  M      81    NA    45.2
##  3 2301-56014  GRCA  F      81    NA    35  
##  4 2421-04527  NOCA  M      81    88    36.2
##  5 2421-11347  NOCA  U      81    NA    37.3
##  6 2711-77545  TUTI  F      81    84    21.2
##  7 2711-77571  TUTI  U      81    75    24.6
##  8 2711-77573  TUTI  U      81    75    22.5
##  9 2711-77582  TUTI  U      81    68.5  21.7
## 10 2711-77733  TUTI  M      81    72.5  20.9
## # … with 2,940 more rows

… and here to where the value is greater than or equal to 80 mm:

first_measures %>% 
  filter(wing >= 80) %>% 
  arrange(wing)
## # A tibble: 2,962 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 1141-05731  TUTI  M        80    70  21.7
##  2 1141-05736  TUTI  F        80    73  19.5
##  3 1232-87932  AMRO  U        80    NA  50.2
##  4 2301-56561  NOCA  F        80    NA  NA  
##  5 2301-56929  GRCA  M        80    NA  31.5
##  6 2341-03957  TUTI  F        80    70  22.3
##  7 2411-37737  NOCA  U        80    NA  41  
##  8 2641-63973  NOCA  U        80    NA  39  
##  9 2711-77543  TUTI  F        80    69  19.9
## 10 2711-77583  TUTI  U        80    70  21.1
## # … with 2,952 more rows

We can also filter based on a character value. Below, I filter to where the value in the species column (spp) is “GRCA” (Gray catbird):

first_measures %>% 
  filter(spp == 'GRCA')
## # A tibble: 1,266 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17029  GRCA  M      90.5  99    34.7
##  2 0791-17030  GRCA  M      90    95    33.4
##  3 0791-17032  GRCA  M      93.5 101    37.4
##  4 0791-17033  GRCA  M      NA    NA    36.4
##  5 0791-17045  GRCA  M      90    94    35  
##  6 0791-17048  GRCA  F      90.5 102    42.7
##  7 0791-17057  GRCA  M      92.5  97.5  38.9
##  8 0791-17058  GRCA  M      95    99.5  37.2
##  9 0791-17068  GRCA  M      87    93    38.5
## 10 0791-17074  GRCA  M      94.5 100    38.9
## # … with 1,256 more rows

   A word of warning! When you filter a data frame based on a condition, all NA values will be dropped! That can lead to unexpected behavior if you’re not careful.

Filter with statistical summaries

It’s often useful to supply a derived summary statistic as a condition. For example, below I filter first_measures to where the value of the wing column is equal to the maximum wing length value:

first_measures %>% 
  filter(
    wing == 
      max(wing,
          na.rm = TRUE))
## # A tibble: 1 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 1202-99075  AMRO  M       139    NA    75

… and here to where the wing value is equal to the minimum wing length value:

first_measures %>% 
  filter(
    wing == 
      min(wing,
          na.rm = TRUE))
## # A tibble: 2 × 6
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-64434  HOWR  U      21.5    NA   8.2
## 2 2750-64435  HOWR  U      21.5    NA   8.6

   It is often necessary to include na.rm = TRUE when conducting statistical summaries. Otherwise, summaries of columns that contain missing values will return a value of NA.

Logical negation

Negation is a powerful tool for filtering. Using the != operator, we can filter values that are NOT equal to the specified condition:

first_measures %>% 
  filter(spp != 'GRCA')
## # A tibble: 5,113 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29420  AMRO  M       126    NA  73.4
##  8 0742-29421  AMRO  M       129    NA  70.2
##  9 0742-29422  AMRO  F       119    NA  81.6
## 10 0742-29423  AMRO  F       122    NA  77.4
## # … with 5,103 more rows

We can also use the logical negation operator itself, !, to negate a filtering statement:

first_measures %>% 
  filter(!spp == 'GRCA')
## # A tibble: 5,113 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29420  AMRO  M       126    NA  73.4
##  8 0742-29421  AMRO  M       129    NA  70.2
##  9 0742-29422  AMRO  F       119    NA  81.6
## 10 0742-29423  AMRO  F       122    NA  77.4
## # … with 5,103 more rows

Filter NA values

We sometimes want to remove NA values from our data. We can do this with the is.na() function. This function is a logical test of whether a given value is NA. Since a logical test returns a value or TRUE or FALSE, it is unnecessary to supply relational operators.

Here, I’ll filter first_measures to where tl (tail length) values are NA:

first_measures %>% 
  filter(is.na(tl))
## # A tibble: 4,392 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29420  AMRO  M       126    NA  73.4
##  8 0742-29421  AMRO  M       129    NA  70.2
##  9 0742-29422  AMRO  F       119    NA  81.6
## 10 0742-29423  AMRO  F       122    NA  77.4
## # … with 4,382 more rows

We can filter to values that are not NA using the negation operator, !:

first_measures %>% 
  filter(!is.na(tl))
## # A tibble: 1,987 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29438  AMRO  M     130   100    71.4
##  2 0742-29489  AMRO  F     126   103    81.9
##  3 0742-29490  AMRO  M     128   100    73.7
##  4 0742-29702  AMRO  F     117    88    72.8
##  5 0791-17014  NOCA  M      91.5  96.5  41  
##  6 0791-17018  NOCA  M      99.5 105    45.4
##  7 0791-17019  NOCA  F      91.5  92    41.3
##  8 0791-17020  NOCA  M      90    23.9  41.7
##  9 0791-17021  NOMO  F     103   111    54.1
## 10 0791-17025  NOCA  M      91    96    43.2
## # … with 1,977 more rows

Filter by multiple conditions

AND: When we want to filter based on whether a given value or values match multiple conditions, we can chain together sets of filtering statements:

first_measures %>% 
  filter(wing > 80) %>% 
  filter(wing < 90)
## # A tibble: 1,086 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17068  GRCA  M      87    93    38.5
##  2 0791-17078  GRCA  M      89    95.5  37.3
##  3 0791-17211  GRCA  F      86    NA    40.8
##  4 0791-17212  GRCA  M      83    NA    34.6
##  5 0791-17214  GRCA  F      89.5  NA    33.3
##  6 0791-17222  GRCA  F      88.5  NA    32.9
##  7 0791-17223  GRCA  M      86    NA    37.9
##  8 0791-17239  GRCA  F      89    NA    43.3
##  9 0791-17248  GRCA  M      88    NA    37.5
## 10 0791-17249  GRCA  F      88    NA    40.3
## # … with 1,076 more rows

… or much more parsimoniously, separate filtering statements with the logical operator & (see ?base::Logic):

first_measures %>% 
  filter(wing > 80 &
           wing < 90)
## # A tibble: 1,086 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17068  GRCA  M      87    93    38.5
##  2 0791-17078  GRCA  M      89    95.5  37.3
##  3 0791-17211  GRCA  F      86    NA    40.8
##  4 0791-17212  GRCA  M      83    NA    34.6
##  5 0791-17214  GRCA  F      89.5  NA    33.3
##  6 0791-17222  GRCA  F      88.5  NA    32.9
##  7 0791-17223  GRCA  M      86    NA    37.9
##  8 0791-17239  GRCA  F      89    NA    43.3
##  9 0791-17248  GRCA  M      88    NA    37.5
## 10 0791-17249  GRCA  F      88    NA    40.3
## # … with 1,076 more rows

… or, even more parsimoniously, simply separate the two filters with a comma:

first_measures %>% 
  filter(wing > 80,
         wing < 90)
## # A tibble: 1,086 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17068  GRCA  M      87    93    38.5
##  2 0791-17078  GRCA  M      89    95.5  37.3
##  3 0791-17211  GRCA  F      86    NA    40.8
##  4 0791-17212  GRCA  M      83    NA    34.6
##  5 0791-17214  GRCA  F      89.5  NA    33.3
##  6 0791-17222  GRCA  F      88.5  NA    32.9
##  7 0791-17223  GRCA  M      86    NA    37.9
##  8 0791-17239  GRCA  F      89    NA    43.3
##  9 0791-17248  GRCA  M      88    NA    37.5
## 10 0791-17249  GRCA  F      88    NA    40.3
## # … with 1,076 more rows

   While less parsimonious, chained filter statements can be less memory-intensive than using & or ,. You won’t recognize a difference until your data are very large (like millions of records).

OR: In the above, each of the conditions of the filter must be met. While that’s often our goal, sometimes we want to filter the data to where either conditions are met.

Let’s see what happens when we try to filter the data to Gray catbird (GRCA) AND Northern cardinal (NOCA):

first_measures %>% 
  filter(spp == 'GRCA',
         spp == 'NOCA')
## # A tibble: 0 × 6
## # … with 6 variables: band_number <chr>, spp <chr>, sex <chr>, wing <dbl>,
## #   tl <dbl>, mass <dbl>

This returns no rows, because no species is both Gray catbird AND Northern cardinal!

To address this, we can use the OR operator, |:

first_measures %>% 
  filter(spp == 'GRCA'|spp == 'NOCA')
## # A tibble: 2,512 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17014  NOCA  M      91.5  96.5  41  
##  2 0791-17018  NOCA  M      99.5 105    45.4
##  3 0791-17019  NOCA  F      91.5  92    41.3
##  4 0791-17020  NOCA  M      90    23.9  41.7
##  5 0791-17025  NOCA  M      91    96    43.2
##  6 0791-17027  NOCA  M      94    NA    43  
##  7 0791-17029  GRCA  M      90.5  99    34.7
##  8 0791-17030  GRCA  M      90    95    33.4
##  9 0791-17031  NOCA  M      95.5 107    43.5
## 10 0791-17032  GRCA  M      93.5 101    37.4
## # … with 2,502 more rows

… or, more parsimoniously, use the %in% operator. This will extract a value that’s within a vector of provided values:

first_measures %>% 
  filter(spp %in% c('GRCA', 'NOCA'))
## # A tibble: 2,512 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17014  NOCA  M      91.5  96.5  41  
##  2 0791-17018  NOCA  M      99.5 105    45.4
##  3 0791-17019  NOCA  F      91.5  92    41.3
##  4 0791-17020  NOCA  M      90    23.9  41.7
##  5 0791-17025  NOCA  M      91    96    43.2
##  6 0791-17027  NOCA  M      94    NA    43  
##  7 0791-17029  GRCA  M      90.5  99    34.7
##  8 0791-17030  GRCA  M      90    95    33.4
##  9 0791-17031  NOCA  M      95.5 107    43.5
## 10 0791-17032  GRCA  M      93.5 101    37.4
## # … with 2,502 more rows

As with the filtering operations above, this can also be used with multiple logical conditions:

first_measures %>% 
  filter(spp %in% c('GRCA', 'NOCA'),
         sex == 'F')
## # A tibble: 785 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17019  NOCA  F      91.5  92    41.3
##  2 0791-17044  NOCA  F      92    95    42.1
##  3 0791-17046  NOCA  F      93    94.5  47.8
##  4 0791-17048  GRCA  F      90.5 102    42.7
##  5 0791-17064  NOCA  F      94    92    42.2
##  6 0791-17079  GRCA  F      90    92.5  38.3
##  7 0791-17211  GRCA  F      86    NA    40.8
##  8 0791-17214  GRCA  F      89.5  NA    33.3
##  9 0791-17215  NOCA  F      92    NA    43.6
## 10 0791-17222  GRCA  F      88.5  NA    32.9
## # … with 775 more rows

Filter multiple variables

You can continue to add filtering statements using a single variable, or multiple variables. For example, we can filter by wing length and subset the data to where the species is Gray catbird (GRCA) with this chained analysis:

first_measures %>% 
  filter(wing > 80) %>% 
  filter(wing < 90) %>% 
  filter(spp == 'GRCA')
## # A tibble: 698 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17068  GRCA  M      87    93    38.5
##  2 0791-17078  GRCA  M      89    95.5  37.3
##  3 0791-17211  GRCA  F      86    NA    40.8
##  4 0791-17212  GRCA  M      83    NA    34.6
##  5 0791-17214  GRCA  F      89.5  NA    33.3
##  6 0791-17222  GRCA  F      88.5  NA    32.9
##  7 0791-17223  GRCA  M      86    NA    37.9
##  8 0791-17239  GRCA  F      89    NA    43.3
##  9 0791-17248  GRCA  M      88    NA    37.5
## 10 0791-17249  GRCA  F      88    NA    40.3
## # … with 688 more rows

… or within a single function (way more parsimonious):

first_measures %>% 
  filter(wing > 80,
         wing < 90,
         spp == 'GRCA')
## # A tibble: 698 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0791-17068  GRCA  M      87    93    38.5
##  2 0791-17078  GRCA  M      89    95.5  37.3
##  3 0791-17211  GRCA  F      86    NA    40.8
##  4 0791-17212  GRCA  M      83    NA    34.6
##  5 0791-17214  GRCA  F      89.5  NA    33.3
##  6 0791-17222  GRCA  F      88.5  NA    32.9
##  7 0791-17223  GRCA  M      86    NA    37.9
##  8 0791-17239  GRCA  F      89    NA    43.3
##  9 0791-17248  GRCA  M      88    NA    37.5
## 10 0791-17249  GRCA  F      88    NA    40.3
## # … with 688 more rows

if_any() and if_all()

If you want to apply the same filter across multiple variables, we can add the functions if_any() or if_all() to our filtering statements.

OR: We can test if one of a selected set of variables satisfies a logical condition with the if_any() function. We supply a vector of variables to test, supply a formula (with the ~ operator, see ?tilde), a filtering statement, and specify the location of the variable in our filtering statement with .x.

In the below, I filter to values where either the wing or tail length (tl) is NA:

first_measures %>% 
  filter(
    if_any(
      wing:tl,
      ~is.na(.x)
    ))
## # A tibble: 4,546 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29420  AMRO  M       126    NA  73.4
##  8 0742-29421  AMRO  M       129    NA  70.2
##  9 0742-29422  AMRO  F       119    NA  81.6
## 10 0742-29423  AMRO  F       122    NA  77.4
## # … with 4,536 more rows

… We can subset the data to rows where neither wing nor tail are values are missing using the logical negation operator:

first_measures %>% 
  filter(
    if_any(
      wing:tl,
      ~!is.na(.x)
    ))
## # A tibble: 6,050 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29420  AMRO  M       126    NA  73.4
##  2 0742-29421  AMRO  M       129    NA  70.2
##  3 0742-29422  AMRO  F       119    NA  81.6
##  4 0742-29423  AMRO  F       122    NA  77.4
##  5 0742-29426  AMRO  U       108    NA  64.4
##  6 0742-29427  AMRO  U       103    NA  48.9
##  7 0742-29428  AMRO  F       115    NA  76.1
##  8 0742-29431  AMRO  F       119    NA  72.4
##  9 0742-29432  AMRO  M       130    NA  71  
## 10 0742-29433  AMRO  M       122    NA  73.5
## # … with 6,040 more rows

AND: We can use if_all() to test whether both values satisfies a logical condition. The construction of the function is equivalent to that of if_any().

Here, we’ll subset to rows where both wing and tail values are missing:

first_measures %>% 
  filter(
    if_all(
      wing:tl,
      ~is.na(.x)
    ))
## # A tibble: 329 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29425  AMRO  U        NA    NA  65.2
##  8 0742-29762  AMRO  U        NA    NA  NA  
##  9 0742-29763  AMRO  U        NA    NA  NA  
## 10 0791-17033  GRCA  M        NA    NA  36.4
## # … with 319 more rows

… and subset to rows where both wing and tail values are not NA:

first_measures %>% 
  filter(
    if_all(
      wing:tl,
      ~!is.na(.x)
    ))
## # A tibble: 1,833 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29438  AMRO  M     130   100    71.4
##  2 0742-29489  AMRO  F     126   103    81.9
##  3 0742-29490  AMRO  M     128   100    73.7
##  4 0742-29702  AMRO  F     117    88    72.8
##  5 0791-17014  NOCA  M      91.5  96.5  41  
##  6 0791-17018  NOCA  M      99.5 105    45.4
##  7 0791-17019  NOCA  F      91.5  92    41.3
##  8 0791-17020  NOCA  M      90    23.9  41.7
##  9 0791-17021  NOMO  F     103   111    54.1
## 10 0791-17025  NOCA  M      91    96    43.2
## # … with 1,823 more rows

   Before applying if_any() or if_all() to your own data, be sure to understand how the functions are subsetting the data. It’s a good idea to test the functions on a small set of dummy data prior to applying it to a full dataset.

Filter by group

We can also apply a filter to grouped data. Recall the group_by() function splits the data into groups and subsequent functions applied to grouped data will be applied to each group separately.

For example, let’s subset the data to the minimum mass for female and male birds:

first_measures %>% 
  filter(sex %in% c('F', 'M')) %>% 
  group_by(sex) %>% 
  filter(
    wing == min(wing, na.rm = TRUE)) 
## # A tibble: 2 × 6
## # Groups:   sex [2]
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 2250-24175  HOWR  M        42    NA  10.5
## 2 2870-37309  HOWR  F        42    42  10.8

We see that the lightest female bird captured was a CACH (Carolina chickadee) and the lighted male bird captured was a HOWR (House wren).

Counting observations in each group: Grouped filters can be used to subset the data to only groups with a certain number of observations. To do this, we use the group size function, n(), to count the number of observations in each group.

Here, we’ll subset the data to only species with 100 or more records:

first_measures %>% 
  group_by(spp) %>% 
  filter(n() >= 100) %>% 
  ungroup()
## # A tibble: 6,207 × 6
##    band_number spp   sex    wing    tl  mass
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>
##  1 0742-29174  AMRO  U        NA    NA  NA  
##  2 0742-29175  AMRO  U        NA    NA  NA  
##  3 0742-29176  AMRO  U        NA    NA  NA  
##  4 0742-29177  AMRO  U        NA    NA  NA  
##  5 0742-29178  AMRO  U        NA    NA  NA  
##  6 0742-29179  AMRO  U        NA    NA  NA  
##  7 0742-29420  AMRO  M       126    NA  73.4
##  8 0742-29421  AMRO  M       129    NA  70.2
##  9 0742-29422  AMRO  F       119    NA  81.6
## 10 0742-29423  AMRO  F       122    NA  77.4
## # … with 6,197 more rows

Important! If we intend to use these data further, it’s necessary to ungroup() the data.

  Now you!

Filter to birds where neither mass nor wing are NA and there are at least 50 individuals measured per species. Assign the object to your global environment with the name measures_subset.

measures_subset <-
  first_measures %>% 
  filter(
    if_all(
      c(wing, mass),
      ~!is.na(.x))) %>% 
  group_by(spp) %>% 
  filter(n() > 100) %>% 
  ungroup()

Add or modify a column

The mutate() family of functions was covered in our previous Custom functions … lesson, but I’m including some brief material on mutation here for completeness.

Modify or add a column with mutate: The mutate() function can be used to modify an existing column or columns.

For example, we can calculate the ratio between the mass and wing of a bird and add it as a new column named mass_wing (*Note: For measures_subset see “Now you!”, above):

measures_subset %>% 
  mutate(mass_wing = mass/wing)
## # A tibble: 5,609 × 7
##    band_number spp   sex    wing    tl  mass mass_wing
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>     <dbl>
##  1 0742-29420  AMRO  M       126    NA  73.4     0.583
##  2 0742-29421  AMRO  M       129    NA  70.2     0.544
##  3 0742-29422  AMRO  F       119    NA  81.6     0.686
##  4 0742-29423  AMRO  F       122    NA  77.4     0.634
##  5 0742-29426  AMRO  U       108    NA  64.4     0.596
##  6 0742-29427  AMRO  U       103    NA  48.9     0.475
##  7 0742-29428  AMRO  F       115    NA  76.1     0.662
##  8 0742-29431  AMRO  F       119    NA  72.4     0.608
##  9 0742-29432  AMRO  M       130    NA  71       0.546
## 10 0742-29433  AMRO  M       122    NA  73.5     0.602
## # … with 5,599 more rows

I should note that adding that variable name is important because we can end up with some ugly (and difficult to use) column names without it:

measures_subset %>% 
  mutate(mass/wing)
## # A tibble: 5,609 × 7
##    band_number spp   sex    wing    tl  mass `mass/wing`
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>       <dbl>
##  1 0742-29420  AMRO  M       126    NA  73.4       0.583
##  2 0742-29421  AMRO  M       129    NA  70.2       0.544
##  3 0742-29422  AMRO  F       119    NA  81.6       0.686
##  4 0742-29423  AMRO  F       122    NA  77.4       0.634
##  5 0742-29426  AMRO  U       108    NA  64.4       0.596
##  6 0742-29427  AMRO  U       103    NA  48.9       0.475
##  7 0742-29428  AMRO  F       115    NA  76.1       0.662
##  8 0742-29431  AMRO  F       119    NA  72.4       0.608
##  9 0742-29432  AMRO  M       130    NA  71         0.546
## 10 0742-29433  AMRO  M       122    NA  73.5       0.602
## # … with 5,599 more rows

It might seem obvious, but mutate() divided the values in the mass column by the values in the wing column. Let’s have a look at what that means though. Both columns are a vector or the same length. In R, when you divide one vector by another, the calculation is conducted by position. You can verify this with the following calculation:

1:3/4:6
## [1] 0.25 0.40 0.50

In the above, the value 1 is divided by 4, 2 is divided by 5, and 3 is divided by 6.

We can also use mutate to add new data to the data frame. The only rule is that the added vector (i.e., column) must be the same length as the data frame.

For example, let’s add an id column, which will basically just be the row number (Note: This doesn’t make a good primary key though!)

measures_subset %>% 
  mutate(id = 1:nrow(.))
## # A tibble: 5,609 × 7
##    band_number spp   sex    wing    tl  mass    id
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl> <int>
##  1 0742-29420  AMRO  M       126    NA  73.4     1
##  2 0742-29421  AMRO  M       129    NA  70.2     2
##  3 0742-29422  AMRO  F       119    NA  81.6     3
##  4 0742-29423  AMRO  F       122    NA  77.4     4
##  5 0742-29426  AMRO  U       108    NA  64.4     5
##  6 0742-29427  AMRO  U       103    NA  48.9     6
##  7 0742-29428  AMRO  F       115    NA  76.1     7
##  8 0742-29431  AMRO  F       119    NA  72.4     8
##  9 0742-29432  AMRO  M       130    NA  71       9
## 10 0742-29433  AMRO  M       122    NA  73.5    10
## # … with 5,599 more rows

I like to use the function row_number() for this task:

measures_subset %>% 
  mutate(id = row_number())
## # A tibble: 5,609 × 7
##    band_number spp   sex    wing    tl  mass    id
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl> <int>
##  1 0742-29420  AMRO  M       126    NA  73.4     1
##  2 0742-29421  AMRO  M       129    NA  70.2     2
##  3 0742-29422  AMRO  F       119    NA  81.6     3
##  4 0742-29423  AMRO  F       122    NA  77.4     4
##  5 0742-29426  AMRO  U       108    NA  64.4     5
##  6 0742-29427  AMRO  U       103    NA  48.9     6
##  7 0742-29428  AMRO  F       115    NA  76.1     7
##  8 0742-29431  AMRO  F       119    NA  72.4     8
##  9 0742-29432  AMRO  M       130    NA  71       9
## 10 0742-29433  AMRO  M       122    NA  73.5    10
## # … with 5,599 more rows

Add & subset columns

The transmute() function works similarly to mutate, but it allows us to add and subset columns in a data frame in one step.

Let’s see what happens when we use transmute with our mass_wing calculation.

measures_subset %>% 
  transmute(mass_wing = mass/wing)
## # A tibble: 5,609 × 1
##    mass_wing
##        <dbl>
##  1     0.583
##  2     0.544
##  3     0.686
##  4     0.634
##  5     0.596
##  6     0.475
##  7     0.662
##  8     0.608
##  9     0.546
## 10     0.602
## # … with 5,599 more rows

This dropped all of the other columns!

What if we wanted to do the equivalent of the below, where we conduct our mutate and then subset the columns to species and mass_wing?

measures_subset %>% 
  mutate(mass_wing = mass/wing) %>% 
  select(spp, mass_wing)
## # A tibble: 5,609 × 2
##    spp   mass_wing
##    <chr>     <dbl>
##  1 AMRO      0.583
##  2 AMRO      0.544
##  3 AMRO      0.686
##  4 AMRO      0.634
##  5 AMRO      0.596
##  6 AMRO      0.475
##  7 AMRO      0.662
##  8 AMRO      0.608
##  9 AMRO      0.546
## 10 AMRO      0.602
## # … with 5,599 more rows

To do this with transmute(), we can simply add spp as an additional column:

measures_subset %>% 
  transmute(
    spp,
    mass_wing = mass/wing)
## # A tibble: 5,609 × 2
##    spp   mass_wing
##    <chr>     <dbl>
##  1 AMRO      0.583
##  2 AMRO      0.544
##  3 AMRO      0.686
##  4 AMRO      0.634
##  5 AMRO      0.596
##  6 AMRO      0.475
##  7 AMRO      0.662
##  8 AMRO      0.608
##  9 AMRO      0.546
## 10 AMRO      0.602
## # … with 5,599 more rows

We can also rename the column in the same step:

measures_subset %>% 
  transmute(
    species = spp,
    mass_wing = mass/wing)
## # A tibble: 5,609 × 2
##    species mass_wing
##    <chr>       <dbl>
##  1 AMRO        0.583
##  2 AMRO        0.544
##  3 AMRO        0.686
##  4 AMRO        0.634
##  5 AMRO        0.596
##  6 AMRO        0.475
##  7 AMRO        0.662
##  8 AMRO        0.608
##  9 AMRO        0.546
## 10 AMRO        0.602
## # … with 5,599 more rows

   The transmute() function is pretty much just a combination of mutate() and select(). The decision of when to use transmute() vs. mutate() should be determined by code parsimony and the ability of your code to communicate the steps taken during data processing.

Grouped mutation

Similar to other functions, mutate() functions can be run on grouped data.

For example, let’s determine whether a given row represents the first observation for a given species:

measures_subset %>% 
  mutate(id = row_number()) %>% 
  group_by(spp) %>% 
  mutate(
    first_obs = id == min(id)) %>% 
  ungroup()
## # A tibble: 5,609 × 8
##    band_number spp   sex    wing    tl  mass    id first_obs
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl> <int> <lgl>    
##  1 0742-29420  AMRO  M       126    NA  73.4     1 TRUE     
##  2 0742-29421  AMRO  M       129    NA  70.2     2 FALSE    
##  3 0742-29422  AMRO  F       119    NA  81.6     3 FALSE    
##  4 0742-29423  AMRO  F       122    NA  77.4     4 FALSE    
##  5 0742-29426  AMRO  U       108    NA  64.4     5 FALSE    
##  6 0742-29427  AMRO  U       103    NA  48.9     6 FALSE    
##  7 0742-29428  AMRO  F       115    NA  76.1     7 FALSE    
##  8 0742-29431  AMRO  F       119    NA  72.4     8 FALSE    
##  9 0742-29432  AMRO  M       130    NA  71       9 FALSE    
## 10 0742-29433  AMRO  M       122    NA  73.5    10 FALSE    
## # … with 5,599 more rows

… we can use this information to subset the data to just the first observation in each group:

measures_subset %>% 
  mutate(
    id = row_number()) %>% 
  group_by(spp) %>% 
  mutate(
    first_obs = id == min(id)) %>% 
  ungroup() %>% 
  filter(first_obs)
## # A tibble: 7 × 8
##   band_number spp   sex    wing    tl  mass    id first_obs
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl> <int> <lgl>    
## 1 0742-29420  AMRO  M     126    NA    73.4     1 TRUE     
## 2 0791-17014  NOCA  M      91.5  96.5  41      49 TRUE     
## 3 0791-17029  GRCA  M      90.5  99    34.7    55 TRUE     
## 4 0820-94571  CACH  F      59.5  49    10.4   155 TRUE     
## 5 1141-05437  CARW  M      62    53    20.3   190 TRUE     
## 6 1141-05444  SOSP  M      66    68    20.9   194 TRUE     
## 7 1821-63303  HOWR  U      99    NA    13     919 TRUE

… though this can be useful in many circumstances, the second mutate statement wasn’t really necessary:

measures_subset %>% 
  mutate(
    id = row_number()) %>% 
  group_by(spp) %>% 
  filter(
    id == min(id))
## # A tibble: 7 × 7
## # Groups:   spp [7]
##   band_number spp   sex    wing    tl  mass    id
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 0742-29420  AMRO  M     126    NA    73.4     1
## 2 0791-17014  NOCA  M      91.5  96.5  41      49
## 3 0791-17029  GRCA  M      90.5  99    34.7    55
## 4 0820-94571  CACH  F      59.5  49    10.4   155
## 5 1141-05437  CARW  M      62    53    20.3   190
## 6 1141-05444  SOSP  M      66    68    20.9   194
## 7 1821-63303  HOWR  U      99    NA    13     919

… nor was the first!

measures_subset %>% 
  group_by(spp) %>% 
  filter(
    row_number() == 
      min(
        row_number()))
## # A tibble: 7 × 6
## # Groups:   spp [7]
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29420  AMRO  M     126    NA    73.4
## 2 0791-17014  NOCA  M      91.5  96.5  41  
## 3 0791-17029  GRCA  M      90.5  99    34.7
## 4 0820-94571  CACH  F      59.5  49    10.4
## 5 1141-05437  CARW  M      62    53    20.3
## 6 1141-05444  SOSP  M      66    68    20.9
## 7 1821-63303  HOWR  U      99    NA    13

… and even that filter was a bit of overkill!

measures_subset %>% 
  group_by(spp) %>% 
  slice_head()
## # A tibble: 7 × 6
## # Groups:   spp [7]
##   band_number spp   sex    wing    tl  mass
##   <chr>       <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29420  AMRO  M     126    NA    73.4
## 2 0820-94571  CACH  F      59.5  49    10.4
## 3 1141-05437  CARW  M      62    53    20.3
## 4 0791-17029  GRCA  M      90.5  99    34.7
## 5 1821-63303  HOWR  U      99    NA    13  
## 6 0791-17014  NOCA  M      91.5  96.5  41  
## 7 1141-05444  SOSP  M      66    68    20.9

Summarize

The summarize() function allows you to calculate summary statistics for a grouped dataset.

Let’s run the below and think about what occurred:

measures_subset %>% 
  group_by(spp) %>% 
  summarize(
    mean_wing = mean(wing, na.rm = TRUE))
## # A tibble: 7 × 2
##   spp   mean_wing
##   <chr>     <dbl>
## 1 AMRO      124. 
## 2 CACH       60.6
## 3 CARW       58.8
## 4 GRCA       88.7
## 5 HOWR       49.8
## 6 NOCA       90.8
## 7 SOSP       65.2

The above provided the average wing length by species … but what is summarize doing?

We could have used mutate to calculate the mean wing length by species:

measures_subset %>% 
  group_by(spp) %>% 
  mutate(
    mean_wing = mean(wing, na.rm = TRUE))
## # A tibble: 5,609 × 7
## # Groups:   spp [7]
##    band_number spp   sex    wing    tl  mass mean_wing
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>     <dbl>
##  1 0742-29420  AMRO  M       126    NA  73.4      124.
##  2 0742-29421  AMRO  M       129    NA  70.2      124.
##  3 0742-29422  AMRO  F       119    NA  81.6      124.
##  4 0742-29423  AMRO  F       122    NA  77.4      124.
##  5 0742-29426  AMRO  U       108    NA  64.4      124.
##  6 0742-29427  AMRO  U       103    NA  48.9      124.
##  7 0742-29428  AMRO  F       115    NA  76.1      124.
##  8 0742-29431  AMRO  F       119    NA  72.4      124.
##  9 0742-29432  AMRO  M       130    NA  71        124.
## 10 0742-29433  AMRO  M       122    NA  73.5      124.
## # … with 5,599 more rows

… then ungrouped the data:

measures_subset %>% 
  group_by(spp) %>% 
  mutate(
    mean_wing = mean(wing, na.rm = TRUE)) %>% 
  ungroup()
## # A tibble: 5,609 × 7
##    band_number spp   sex    wing    tl  mass mean_wing
##    <chr>       <chr> <chr> <dbl> <dbl> <dbl>     <dbl>
##  1 0742-29420  AMRO  M       126    NA  73.4      124.
##  2 0742-29421  AMRO  M       129    NA  70.2      124.
##  3 0742-29422  AMRO  F       119    NA  81.6      124.
##  4 0742-29423  AMRO  F       122    NA  77.4      124.
##  5 0742-29426  AMRO  U       108    NA  64.4      124.
##  6 0742-29427  AMRO  U       103    NA  48.9      124.
##  7 0742-29428  AMRO  F       115    NA  76.1      124.
##  8 0742-29431  AMRO  F       119    NA  72.4      124.
##  9 0742-29432  AMRO  M       130    NA  71        124.
## 10 0742-29433  AMRO  M       122    NA  73.5      124.
## # … with 5,599 more rows

… then selected our columns of interest with select():

measures_subset %>% 
  group_by(spp) %>% 
  mutate(
    mean_wing = mean(wing, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(spp, mean_wing)
## # A tibble: 5,609 × 2
##    spp   mean_wing
##    <chr>     <dbl>
##  1 AMRO       124.
##  2 AMRO       124.
##  3 AMRO       124.
##  4 AMRO       124.
##  5 AMRO       124.
##  6 AMRO       124.
##  7 AMRO       124.
##  8 AMRO       124.
##  9 AMRO       124.
## 10 AMRO       124.
## # … with 5,599 more rows

… then subset the data to unique rows with distinct():

measures_subset %>% 
  group_by(spp) %>% 
  mutate(
    mean_wing = mean(wing, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(spp, mean_wing) %>% 
  distinct()
## # A tibble: 7 × 2
##   spp   mean_wing
##   <chr>     <dbl>
## 1 AMRO      124. 
## 2 NOCA       90.8
## 3 GRCA       88.7
## 4 CACH       60.6
## 5 CARW       58.8
## 6 SOSP       65.2
## 7 HOWR       49.8

But that’s a lot of coding … summarize saves us loads of time and really enhances code parsimony!

Multiple groups

I often use the summarize() function, in combination with n() for calculating the number of observations per group:

measures_subset %>% 
  group_by(spp) %>% 
  summarize(n = n())
## # A tibble: 7 × 2
##   spp       n
##   <chr> <int>
## 1 AMRO    484
## 2 CACH   1006
## 3 CARW    610
## 4 GRCA   1157
## 5 HOWR    748
## 6 NOCA   1172
## 7 SOSP    432

What if we wanted to summarize the number of observations by multiple groups though? We simply separate additional groups in group_by() by a comma. In doing so, any function (e.g., split, filter, mutate) will run on each group separately.

In the code below, I summarize the number of observations by species and sex:

measures_subset %>% 
  group_by(spp, sex) %>% 
  summarize(n = n())
## `summarise()` has grouped output by 'spp'. You can override using the `.groups`
## argument.
## # A tibble: 28 × 3
## # Groups:   spp [7]
##    spp   sex       n
##    <chr> <chr> <int>
##  1 AMRO  F       201
##  2 AMRO  M       199
##  3 AMRO  U        69
##  4 AMRO  <NA>     15
##  5 CACH  F       228
##  6 CACH  M       199
##  7 CACH  U       532
##  8 CACH  <NA>     47
##  9 CARW  F       174
## 10 CARW  M       151
## # … with 18 more rows

But notice the message we received … the data are still grouped. By default, summarize() will drop the last group used. If we summarize based on a single group, we don’t have to think about it. When we summarize based on multiple groups, however, we need to specify .groups = 'drop' to remove the grouping structure from the data:

measures_subset %>% 
  group_by(spp, sex) %>% 
  summarize(n = n(),
            .groups = 'drop')
## # A tibble: 28 × 3
##    spp   sex       n
##    <chr> <chr> <int>
##  1 AMRO  F       201
##  2 AMRO  M       199
##  3 AMRO  U        69
##  4 AMRO  <NA>     15
##  5 CACH  F       228
##  6 CACH  M       199
##  7 CACH  U       532
##  8 CACH  <NA>     47
##  9 CARW  F       174
## 10 CARW  M       151
## # … with 18 more rows
  Now you!

Modify the above such that the NA values in sex are removed and the resultant object is a four column tibble with the names spp, F (female), M (male), and U (unknown).

measures_subset %>% 
  filter(!is.na(sex)) %>% 
  group_by(spp, sex) %>% 
  summarize(n = n(),
            .groups = 'drop') %>% 
  pivot_wider(
    names_from = sex,
    values_from = n)
## # A tibble: 7 × 4
##   spp       F     M     U
##   <chr> <int> <int> <int>
## 1 AMRO    201   199    69
## 2 CACH    228   199   532
## 3 CARW    174   151   246
## 4 GRCA    333   590   185
## 5 HOWR    199   424   104
## 6 NOCA    395   505   215
## 7 SOSP     76   288    47

Join tables

Once data are tidy they can often be difficult to use unless we are able to join information across tables.

As I’ve stated throughout this course, a key is used to reference a set of values. When we store data in our global environment, the name of the data object is actually a key that R (and you) uses to retrieve the data. If you overwrite a key, R can no longer retrieve the data.

With tidy tabular data, two types of keys are of particular interest: primary and foreign keys. A primary key is a column that provides a unique identifier for each row of a dataset. A foreign key is a column that provides a reference to the primary key of another dataset. When we conduct a join, we are joining tables based on these two sets of keys.

Below, we see how tables in a relational database are interconnected. Each connection is defined by primary and foreign keys (from Wickham and Grolemund):

For this exercise, we’ll use the tibbles visits, counts, and birds (all of which are in your global environment).

Each row in the visits table represents the date that a site was visited. It contains a primary key, visit_id. Let’s take a look:

visits
## # A tibble: 1,394 × 3
##    visit_id        site_id date      
##    <chr>           <chr>   <chr>     
##  1 49sd69T3908d480 4F825x0 2000-04-28
##  2 N4110Z739160380 P32Ka16 2000-05-08
##  3 9000637391b4980 68yQ999 2000-05-12
##  4 22mF4Si39233280 O030IH1 2000-05-18
##  5 771vBqK39233280 P32Ka16 2000-05-18
##  6 Sa2Y688392dbe80 K0w2408 2000-05-26
##  7 L6Nwh253935a780 T541U66 2000-06-01
##  8 948U0i23936f900 F6H60d3 2000-06-02
##  9 2SnXV013936f900 09Z5mxk 2000-06-02
## 10 K6240j539384a80 6F1w870 2000-06-03
## # … with 1,384 more rows

The counts table, which we’ve use previously, represents the counts of birds, by species, during a visit. This table contains a primary key, count_id, a foreign key to the visits table, visit_id, and a foreign key to the birds table, spp.

counts
## # A tibble: 2,640 × 4
##    count_id visit_id        spp   count
##    <chr>    <chr>           <chr> <dbl>
##  1 Nrpl13   3Fxh65U4fb43f80 AMRO      2
##  2 m9U585   3Fxh65U4fb43f80 GRCA      2
##  3 0Wlz34   3Fxh65U4fb43f80 MODO      1
##  4 WNf927   3Fxh65U4fb43f80 NOCA      2
##  5 606emZ   3Fxh65U4fb43f80 RBWO      1
##  6 47f20A   3Fxh65U4fb43f80 SOSP      1
##  7 p24140   3Fxh65U4fb43f80 WBNU      1
##  8 696vx5   8iXAa9851afd100 AMRO      1
##  9 WrF933   8iXAa9851afd100 CACH      1
## 10 tj6Q6m   8iXAa9851afd100 CARW      1
## # … with 2,630 more rows

The birds table contains natural history information for each of the birds observed. The primary key of this table is species and there are no foreign keys.

birds
## # A tibble: 38 × 4
##    species common_name          foraging diet       
##    <chr>   <chr>                <chr>    <chr>      
##  1 AMCR    American Crow        ground   omnivore   
##  2 AMGO    American Goldfinch   foliage  granivore  
##  3 AMRO    American Robin       ground   omnivore   
##  4 BAOR    Baltimore Oriole     foliage  insectivore
##  5 BHCO    Brown-headed Cowbird ground   omnivore   
##  6 BLJA    Blue Jay             ground   omnivore   
##  7 CACH    Carolina Chickadee   foliage  insectivore
##  8 CARW    Carolina Wren        ground   insectivore
##  9 CEDW    Cedar Waxwing        foliage  omnivore   
## 10 CHSP    Chipping Sparrow     ground   omnivore   
## # … with 28 more rows

Join matching rows

Joins connect two tables, joining information based on matching key values.

Left joins The join that I use most often is a left_join(). Optimally (but not always), a left_join() is conducted by joining the table with the foreign key to the table with the primary key, the target table. In a left join, all records in the target table are maintained and only matching records from the other table are maintained.

Let’s join counts (primary key: count_id, foreign key: spp) to birds (primary key: birds). We specify the name of the target table, then the name of the table we are joining, then the keys for the join with the argument by =. Notice that the keys don’t share a name, so we have to supply the key to the target and joining tables as a vector.

birds %>% 
  left_join(
    counts, 
    by = c('species' = 'spp'))
## # A tibble: 2,641 × 7
##    species common_name   foraging diet     count_id visit_id        count
##    <chr>   <chr>         <chr>    <chr>    <chr>    <chr>           <dbl>
##  1 AMCR    American Crow ground   omnivore k96300   150jN34577eed00     1
##  2 AMCR    American Crow ground   omnivore 673Cyd   G686H0051edc780     2
##  3 AMCR    American Crow ground   omnivore C04qxA   q1fkjW85b15d280     1
##  4 AMCR    American Crow ground   omnivore uD255E   c3X17E65733c780     1
##  5 AMCR    American Crow ground   omnivore 56D6j8   7FoMJ055aef9700     2
##  6 AMCR    American Crow ground   omnivore Q20bmT   3r1759651b27400     1
##  7 AMCR    American Crow ground   omnivore abrJHZ   ak357F35ed44500     2
##  8 AMCR    American Crow ground   omnivore 5U846a   9Y5P5465761ec00     1
##  9 AMCR    American Crow ground   omnivore 81z4W4   7q4F5J3573a5f00     2
## 10 AMCR    American Crow ground   omnivore B7F819   036560E5941ce00     1
## # … with 2,631 more rows

What if we were only interested in the common_name variable in the birds table? We can use select to subset the table to the variables of interest (optimally before the join):

birds %>% 
  select(species, common_name) %>% 
  left_join(
    counts, 
    by = c('species' = 'spp'))
## # A tibble: 2,641 × 5
##    species common_name   count_id visit_id        count
##    <chr>   <chr>         <chr>    <chr>           <dbl>
##  1 AMCR    American Crow k96300   150jN34577eed00     1
##  2 AMCR    American Crow 673Cyd   G686H0051edc780     2
##  3 AMCR    American Crow C04qxA   q1fkjW85b15d280     1
##  4 AMCR    American Crow uD255E   c3X17E65733c780     1
##  5 AMCR    American Crow 56D6j8   7FoMJ055aef9700     2
##  6 AMCR    American Crow Q20bmT   3r1759651b27400     1
##  7 AMCR    American Crow abrJHZ   ak357F35ed44500     2
##  8 AMCR    American Crow 5U846a   9Y5P5465761ec00     1
##  9 AMCR    American Crow 81z4W4   7q4F5J3573a5f00     2
## 10 AMCR    American Crow B7F819   036560E5941ce00     1
## # … with 2,631 more rows

If we no longer have a need for the key column after the join, we can remove it with a select argument after the join:

birds %>% 
  select(species, common_name) %>% 
  left_join(
    counts, 
    by = c('species' = 'spp')) %>% 
  select(!species)
## # A tibble: 2,641 × 4
##    common_name   count_id visit_id        count
##    <chr>         <chr>    <chr>           <dbl>
##  1 American Crow k96300   150jN34577eed00     1
##  2 American Crow 673Cyd   G686H0051edc780     2
##  3 American Crow C04qxA   q1fkjW85b15d280     1
##  4 American Crow uD255E   c3X17E65733c780     1
##  5 American Crow 56D6j8   7FoMJ055aef9700     2
##  6 American Crow Q20bmT   3r1759651b27400     1
##  7 American Crow abrJHZ   ak357F35ed44500     2
##  8 American Crow 5U846a   9Y5P5465761ec00     1
##  9 American Crow 81z4W4   7q4F5J3573a5f00     2
## 10 American Crow B7F819   036560E5941ce00     1
## # … with 2,631 more rows

What if we don’t need the count_id field? Because of my obsession with code parsimony, you might think that I’d remove it in the last select statement. Instead, I remove it before joining the two data sets.

birds %>% 
  select(species, common_name) %>% 
  left_join(
    counts %>% 
      select(!count_id), 
    by = c('species' = 'spp')) %>% 
  select(!species)
## # A tibble: 2,641 × 3
##    common_name   visit_id        count
##    <chr>         <chr>           <dbl>
##  1 American Crow 150jN34577eed00     1
##  2 American Crow G686H0051edc780     2
##  3 American Crow q1fkjW85b15d280     1
##  4 American Crow c3X17E65733c780     1
##  5 American Crow 7FoMJ055aef9700     2
##  6 American Crow 3r1759651b27400     1
##  7 American Crow ak357F35ed44500     2
##  8 American Crow 9Y5P5465761ec00     1
##  9 American Crow 7q4F5J3573a5f00     2
## 10 American Crow 036560E5941ce00     1
## # … with 2,631 more rows

This order of operations allows you to step through the code and have a solid sense of the data being modified at each step.

Let’s now join our counts table (foreign key: visit_id) to the visits table (primary key: visit_id).

visits %>% 
  left_join(
    counts,
    by = 'visit_id')
## # A tibble: 3,718 × 6
##    visit_id        site_id date       count_id spp   count
##    <chr>           <chr>   <chr>      <chr>    <chr> <dbl>
##  1 49sd69T3908d480 4F825x0 2000-04-28 <NA>     <NA>     NA
##  2 N4110Z739160380 P32Ka16 2000-05-08 <NA>     <NA>     NA
##  3 9000637391b4980 68yQ999 2000-05-12 <NA>     <NA>     NA
##  4 22mF4Si39233280 O030IH1 2000-05-18 <NA>     <NA>     NA
##  5 771vBqK39233280 P32Ka16 2000-05-18 <NA>     <NA>     NA
##  6 Sa2Y688392dbe80 K0w2408 2000-05-26 <NA>     <NA>     NA
##  7 L6Nwh253935a780 T541U66 2000-06-01 <NA>     <NA>     NA
##  8 948U0i23936f900 F6H60d3 2000-06-02 <NA>     <NA>     NA
##  9 2SnXV013936f900 09Z5mxk 2000-06-02 <NA>     <NA>     NA
## 10 K6240j539384a80 6F1w870 2000-06-03 <NA>     <NA>     NA
## # … with 3,708 more rows

This didn’t work as well as I’d hoped. The left_join included visits where no point counts occurred – and there were apparently a lot of them.

Inner joins: We can join only matching keys using an inner_join(). Only key values that are present in the target and the joining tables are maintained:

visits %>% 
  inner_join(
    counts,
    by = 'visit_id')
## # A tibble: 2,640 × 6
##    visit_id        site_id date       count_id spp   count
##    <chr>           <chr>   <chr>      <chr>    <chr> <dbl>
##  1 43929684fa31c00 061Ud74 2012-05-04 iqo47Z   AMRO      1
##  2 43929684fa31c00 061Ud74 2012-05-04 1Nit1v   BHCO      1
##  3 43929684fa31c00 061Ud74 2012-05-04 169DZQ   COGR      3
##  4 43929684fa31c00 061Ud74 2012-05-04 2l7d7X   GRCA      1
##  5 43929684fa31c00 061Ud74 2012-05-04 r0B33O   HAWO      1
##  6 43929684fa31c00 061Ud74 2012-05-04 974f13   HOWR      1
##  7 43929684fa31c00 061Ud74 2012-05-04 1p307m   NOCA      2
##  8 43929684fa31c00 061Ud74 2012-05-04 w032J0   RBWO      2
##  9 43929684fa31c00 061Ud74 2012-05-04 nI4g86   SOSP      2
## 10 7vAF6bg4fab0500 3M87363 2012-05-10 985k62   AMRO      2
## # … with 2,630 more rows

I use inner joins often but, if you care about the values that aren’t shared across tables, and aren’t careful in exploring your data, inner joins can sometimes lead to data loss.

Join all rows

Full joins: A full join will maintain all records from both tables, regardless of whether the keys match. Non-matching values are coded as NA.

To illustrate this, I’ll join counts from a single visit to the birds table (and select just a few of the columns to make the table more legible):

birds %>% 
  select(species, common_name) %>% 
  full_join(
    counts %>% 
      filter(visit_id == '2h08H4660c93f00'), 
    by = c('species' = 'spp'))
## # A tibble: 38 × 5
##    species common_name          count_id visit_id        count
##    <chr>   <chr>                <chr>    <chr>           <dbl>
##  1 AMCR    American Crow        <NA>     <NA>               NA
##  2 AMGO    American Goldfinch   <NA>     <NA>               NA
##  3 AMRO    American Robin       <NA>     <NA>               NA
##  4 BAOR    Baltimore Oriole     <NA>     <NA>               NA
##  5 BHCO    Brown-headed Cowbird <NA>     <NA>               NA
##  6 BLJA    Blue Jay             o41359   2h08H4660c93f00     2
##  7 CACH    Carolina Chickadee   <NA>     <NA>               NA
##  8 CARW    Carolina Wren        4Z17jG   2h08H4660c93f00     1
##  9 CEDW    Cedar Waxwing        <NA>     <NA>               NA
## 10 CHSP    Chipping Sparrow     <NA>     <NA>               NA
## # … with 28 more rows

This can be very useful when we’re looking to maintain those NA values!

Filtering joins

We use joins to filter data based on matching (or non-matching) key-values. These are called filtering joins. In a filtering join, no columns are added to the target data frame.

Semi-joins: Semi-joins subset data in the target table by the key values that are present in the joining table.

Let’s filter the count data to counts of species observed with a insectivorous diet (diet == insectivore).

counts %>% 
  semi_join(
    birds %>% 
      filter(diet == 'insectivore'),
    by = c('spp' = 'species'))
## # A tibble: 938 × 4
##    count_id visit_id        spp   count
##    <chr>    <chr>           <chr> <dbl>
##  1 p24140   3Fxh65U4fb43f80 WBNU      1
##  2 WrF933   8iXAa9851afd100 CACH      1
##  3 tj6Q6m   8iXAa9851afd100 CARW      1
##  4 M8r390   8iXAa9851afd100 EAPH      1
##  5 e3K50d   8iXAa9851afd100 TUTI      1
##  6 3Q3244   150jN34577eed00 CARW      2
##  7 V12I4K   150jN34577eed00 DOWO      1
##  8 hG9X4o   150jN34577eed00 TUTI      4
##  9 t528Wu   0n075TL4ff23600 CARW      1
## 10 1071D1   0n075TL4ff23600 HOWR      1
## # … with 928 more rows

Anti-joins: Semi-joins subset data in the target table by the key values that are not present in the joining table.

For example, we can subset the count data to counts of species observed who do not have an insectivorous diet:

counts %>% 
  anti_join(
    birds %>% 
      filter(diet == 'insectivore'),
    by = c('spp' = 'species'))
## # A tibble: 1,702 × 4
##    count_id visit_id        spp   count
##    <chr>    <chr>           <chr> <dbl>
##  1 Nrpl13   3Fxh65U4fb43f80 AMRO      2
##  2 m9U585   3Fxh65U4fb43f80 GRCA      2
##  3 0Wlz34   3Fxh65U4fb43f80 MODO      1
##  4 WNf927   3Fxh65U4fb43f80 NOCA      2
##  5 606emZ   3Fxh65U4fb43f80 RBWO      1
##  6 47f20A   3Fxh65U4fb43f80 SOSP      1
##  7 696vx5   8iXAa9851afd100 AMRO      1
##  8 B4o8a2   8iXAa9851afd100 GRCA      1
##  9 P58660   8iXAa9851afd100 MODO      1
## 10 0411N0   8iXAa9851afd100 NOCA      4
## # … with 1,692 more rows

To better understand joins, I strongly recommend reading the R for Data Science chapter (link) on joining data. It’s one of the most important skills for working with data!